The Data

Rows:

Each row is a room listed on AirBnB in New York

Columns:

  1. neighborhood_group: The neighborhood group of the listed room
  2. neighborhood: The neighborhood of the listed room
  3. latitude: The latitude coordinate of the listed room
  4. longitude: The longitude coordinate of the listed room
  5. room type: The room type of the listed room
  6. price: The price per night of the listed room
  7. minimum nights: The minimum number of nights required to rent a the listed room
  8. number of reviews: The total number of reviews for the listed room
  9. reviews per month: The number of reviews per month for the listed room
  10. calculated host listings count: The host listing count of the listed room
  11. availability 365: The availability of the room per year
  12. number of reviews ltm: The number of reviews last twelve month of the listed room
  13. license: The license of the listed room

Our exploration starts by seeing if availability differs by neighborhood group

neighbourhoodgroup_availability <- ddply(df,~neighbourhood_group,summarise,avg=mean(availability_365)) 
ggplot(neighbourhoodgroup_availability, aes(x = neighbourhood_group, y = avg, fill = neighbourhood_group))+ geom_bar(stat="identity") + xlab("Neighbourhood Group")+ylab("Average availability per year")+scale_fill_discrete("Neighbourhood group")+ggtitle("Average availability per year against neighbourhood group") + labs(subtitle = "plot 1")

Manhattan and Brooklyn have the lowest average availability per year (highest average booking), this might be only because of the neighborhood group or other features. We further investigate the proportion of room types per neighborhood group to uncover if proportion of room types plays a role in the average booking rate.

neighbourhoodgroup_roomType <- ddply(df,~neighbourhood_group + room_type,summarise,count=length(room_type))
ggplot(neighbourhoodgroup_roomType, aes(x = neighbourhood_group, y = count, fill = room_type))+geom_bar(stat = "identity", position = "fill")+xlab("Neighbourhood Group")+ylab("Proportion of each room type")+scale_fill_discrete("Neighbourhood group")+ggtitle("Proportion of room type per neighbourhood group") + labs(subtitle = 'plot 2')

Manhattan & Brooklyn have the highest proportion of Entire home/apartments. We assume that Manhattan and Brooklyn are booked more (lower availability per year) because of higher demands on specific room types,in this case Entire home apartments. We further investigate this assumption through plotting the average availability per room type.

n <- ddply(df,~room_type,summarise,avgAvailability=mean(availability_365))
ggplot(n,aes(x=room_type,y=avgAvailability,fill=room_type)) + geom_bar(stat = "identity") + ylab('Average avaialability per year') + xlab('Room type') + scale_fill_discrete("Room type") + labs(subtitle = 'plot 3')+ggtitle("Average availability by room type")

We can see that Entire home apartments & Private rooms are the most booked on average, this might explain why Manhattan and Brooklyn have the lowest availability. We look closer through inspecting the availability per year, per room type across all neighborhood groups

ggplot(df,aes(x=room_type,y=availability_365,fill = room_type)) + geom_boxplot() + facet_wrap(~neighbourhood_group)+theme(axis.text.x = element_blank(),axis.title.x = element_blank()) + ylab('Availability per year') + labs(fill='Room type') +  labs(subtitle = 'plot 4')+ggtitle("Availability by room type across neighborhood group")

Private rooms and entire home apartment perform best across all neighborhood groups. Knowing that Brooklyn and Manhattan have the largest proportion of Entire home apartments and Private rooms (plot 2) and the lower availability of these room types (plot 4), this explains the lower average availability in those groups.

We’ll now inspect the relation between availability per year and price

ggplot(df,aes(x = price, y = availability_365)) + geom_point() + geom_smooth() + xlab('Price') + ylab('Availability per year') +labs(subtitle = 'plot 5')+ggtitle("Availability against price")

ggplot(df,aes(x = price,y = availability_365))+ geom_point() + geom_smooth() + facet_wrap(~neighbourhood_group) + xlab('Price')+ylab('Avaialbility per year') + labs(subtitle = 'plot 6')+ggtitle("Availability against price by neighborhood group")

We can observe that lower prices tend to lead to more bookings in all neighborhood groups except for Staten Island.

neighbourhoodgroup_price <- ddply(df,~neighbourhood_group,summarise,avg=mean(price)) 
ggplot(neighbourhoodgroup_price, aes(x = neighbourhood_group, y = avg, fill = neighbourhood_group)) + geom_bar(stat="identity")+ xlab("Neighbourhood Group") + ylab("Average Price") + scale_fill_discrete("Neighbourhood group") + ggtitle("Average price against neighbourhood group") + labs(subtitle ='plot 7')

We can see that Manhattan and Brooklyn are the most expensive neighborhood groups on average, this could be explained through higher proportion of entire home apartments and private rooms (plot 2) and lower availability of these rooms (plot 4) which would increase the price based on the demand. We further inspect the average pricing per neighbourhood across room types.

ggplot(data = df,aes(x = neighbourhood_group,y = price,fill = room_type)) + geom_boxplot() + xlab('Neighbourhood group') + ylab('price') + facet_wrap(~room_type) + theme(axis.text.x = element_text(angle = 90)) +guides(fill = "none")+ggtitle("Price by neighborhood group across room type")+labs(subtitle ='plot 8')

Entire home/apartment are more expensive in all neighborhood groups (with a minor exception).

We want to see if Manhattan is more expensive because of certain minor neighborhoods

t <- table(df$neighbourhood)
top_neighbourhoods <- c('Bedford-Stuyvesant','Bushwick','Crown Heights','Williamsburg','Harlem','Midtown','Upper West Side',"Hell's Kitchen",'Upper East Side','East Village','Chelsea','Lower East Side')
df_sub <- subset(df,neighbourhood %in% top_neighbourhoods)
n <- ddply(df_sub,~neighbourhood_group + neighbourhood ,summarise,avgP = mean(price))
ggplot(n,aes(x = neighbourhood, y = avgP, fill = neighbourhood_group)) + geom_bar(stat = 'identity')+ xlab('Neighbourhood') + ylab('Average price') + theme(axis.text.x = element_text(angle = 90)) + labs(subtitle = 'plot 9') +ggtitle("Average price of ten most expensive neighborhoods")

We can see that the top neighborhoods in Manhattan have a similar average pricing ranging from approximately 150 to 200. In Brooklyn average pricing is similar between neighborhood. So the higher average pricing is not caused by certain neighborhoods.

t <- table(df$neighbourhood)
top_neighbourhoods <- c('Bedford-Stuyvesant','Bushwick','Crown Heights','Williamsburg','Harlem','Midtown','Upper West Side',"Hell's Kitchen",'Upper East Side','East Village','Chelsea','Lower East Side')
df_sub <- subset(df,neighbourhood %in% top_neighbourhoods)
df11 <- ddply(df_sub,~ room_type + neighbourhood_group + neighbourhood,summarise,count = length(neighbourhood))
ggplot(df11,aes(x = neighbourhood,y= count, fill = neighbourhood_group)) + geom_bar(stat = 'identity') + facet_wrap(~room_type) + theme(axis.text.x = element_text(angle=90)) + labs(fill='Neighbourhood group') + xlab('Neighbourhood') + ylab('Count')+ labs(subtitle = 'plot 10')+ggtitle("Count of room types in ten most expensive neighborhoods")

Most listings in the top 10 neighborhoods are private rooms and entire home apartments, Hotels and shared rooms are very few.

To continue our exploration, we wanted to see if instead of the neighborhoods affecting the data it may be a select number of owners affecting the prices.

ggplot(df,aes(x=calculated_host_listings_count,y=price)) + geom_point() + geom_smooth() + xlab('Calculated host listing count') + ylab('Price')+ labs(subtitle = 'plot 11')+ggtitle("Price against host listing count")

There is no clear trend and the same host listing count shows a variety of prices. Perhaps a better trend can be observed when we cluster by room type.

ggplot(df,aes(x=calculated_host_listings_count,y=price)) + geom_point() + geom_smooth() + xlab('Calculated host listing count') + ylab('Price')+ labs(subtitle = 'plot 12')+ggtitle("Price against host listing count by room type")+facet_wrap(~room_type)

There also appears to be no clear patern. We will continue our exploration by evaluating if instead of price, the host listing count is affecting the availability directly.

ggplot(df,aes(x=calculated_host_listings_count,y=availability_365)) + geom_point() + geom_smooth() + xlab('Calculated host listing count') + ylab('Availability per year')+ labs(subtitle = 'plot 12')+ggtitle("Availability against host listing count")+labs(subtitle="plot 13")

Again, it is difficult to discern anything meaningful. We will go a step further and see if grouping by room type will reveal anything.

ggplot(df,aes(x=calculated_host_listings_count,y=availability_365)) + geom_point() + geom_smooth() + xlab('Calculated host listing count') + ylab('Availability per year') + facet_wrap(~room_type)+labs(subtitle="plot 14")+ggtitle("Availability against host listing count by room type")

Availability per year is not related to the calculated host listings count. We will now see if something else is affecting availability; the reviews in the last twelve months.

ggplot(df,aes(number_of_reviews_ltm,availability_365)) + geom_point() + geom_smooth() + xlab('Number of reviews last twelve months')+ylab('Availability per year')+labs(subtitle = "plot 15")+ggtitle("Availability against reviews in last twelve months")

Interestingly, we see that as the number of reviews in the last twelve month increase the availability per year increases. This could mean that AirBnB users write an excess of negative feedback because they are angry or these reviews are falsified by the host in an attempt to compete and attract AirBnB users.

ggplot(df,aes(number_of_reviews_ltm,availability_365)) + geom_point() + geom_smooth() + facet_wrap(~room_type) +xlim(0,365)+
  ylab('Availability per year')+xlab('Number of reviews last twelve months')+labs(subtitle = "plot 16") + ggtitle("Availability against reviews in last twelve months across room type")

We can see that that the trend above translates into all room types with the exception of shared rooms.

Map per feature

Room type

  1. navy blue: Private room
  2. red: Entire home/apt
  3. black: shared room
pal <- colorFactor(c("navy", "red","black"), domain = c("Private room", "Entire home/apt","shared room"))
map <- leaflet() %>% addTiles() %>%  setView(lat = mean(df$latitude),lng = mean(df$longitude), zoom = 10) %>% addCircleMarkers(data=df[,],radius = 0.1,color = ~pal(room_type))
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
map

Availability per year

  1. yellow: less than 100 days
  2. orange: less than 200 days
  3. red: greater than 200 days
df$aval_cat <- cut(df$availability_365,c(0,100,200,400),labels = c('<100','100-200','>200'))
pal <- colorFactor(c("yellow", "orange","red"), domain = c('<100','100-200','>200'))
map <- leaflet() %>% addTiles() %>%  setView(lat = mean(df$latitude),lng = mean(df$longitude), zoom = 10) %>% addCircleMarkers(data=df[,],radius = 0.1,color = ~pal(aval_cat))
map

Price per day

  1. yellow: less than 100$
  2. orange: less than 200$
  3. red: greater than 200$
df$price_cat <- cut(df$price,c(0,100,200,1000),labels = c('<100','100-200','>200'))
pal <- colorFactor(c("yellow", "orange","red"), domain = c('<100','100-200','>200'))
map <- leaflet() %>% addTiles() %>%  setView(lat = mean(df$latitude),lng = mean(df$longitude), zoom = 10) %>% addCircleMarkers(data=df[,],radius = 0.1,color = ~pal(price_cat))
map

To explore the latitude and longitude, we felt it was best to plot a map and have the points representing different levels of the columns. We can see that there is a cluster where the private rooms are condensed and it would be interesting to see what in that area makes it more appealing for private rooms as opposed to entire homes. It is also very clear from the map that there are clusters of red for the price around Manhattan confirming what we established from the graphs that it is the most expensive. We can also see some yellow clusters indicating some areas with cheaper rent and we could go into detail and see that the listings near the airports tend to be cheaper.